Skip to content

fix(en): improve normalization after voxpopuli results#27

Open
egenthon-cmd wants to merge 2 commits into
mainfrom
feat/update-english-normalization
Open

fix(en): improve normalization after voxpopuli results#27
egenthon-cmd wants to merge 2 commits into
mainfrom
feat/update-english-normalization

Conversation

@egenthon-cmd

@egenthon-cmd egenthon-cmd commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Align ref/hyp WER gaps for corpus aliases, possessives, article refs, elided percentages, and hundred-scale numbers misheard as trailing zero.

What does this PR do?

Align ref/hyp WER gaps for corpus aliases, possessives, article refs,
elided percentages, and hundred-scale numbers misheard as trailing zero.

Type of change

  • New language
  • Edit existing language (fix a replacement, tweak config, …)
  • New normalization step
  • Edit existing step (bug fix, behaviour change)
  • New preset version
  • Bug fix (other)
  • Refactor / docs / CI

Checklist

Only fill in the section(s) that match your change — delete the rest.

Edit existing language

  • New/changed word substitutions go in replacements.py, not inline in operators.py
  • If you changed a config field that can be None: the step reading it still handles None gracefully
  • Unit tests updated or added
  • E2e CSV updated if the expected output changed

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

  • New Features

    • Improved English text normalization for numbers, percentages, parliamentary references, and common word aliases.
    • Better cleanup of possessive forms and spaced apostrophes after normalization.
  • Bug Fixes

    • More reliably preserves intended number formatting in tricky compound phrases.
    • Tightened space-removal behavior between digits to avoid over-collapsing in edge cases.
  • Documentation

    • Updated normalization guidance to reflect the expanded apostrophe and possessive handling.

Align ref/hyp WER gaps for corpus aliases, possessives, article refs,
elided percentages, and hundred-scale numbers misheard as trailing zero.
@egenthon-cmd egenthon-cmd requested a review from karamouche June 30, 2026 17:15
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@egenthon-cmd, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 9 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22ddad75-aa52-4b4a-ac00-8e6b622500ce

📥 Commits

Reviewing files that changed from the base of the PR and between 387f96a and 3d97f82.

📒 Files selected for processing (7)
  • docs/steps.md
  • normalization/languages/english/operators.py
  • normalization/presets/gladia-3.yaml
  • normalization/steps/text/remove_spaces_between_adjacent_digits.py
  • normalization/steps/text/remove_trailing_apostrophe_space.py
  • tests/unit/languages/english_voxpopuli_normalization_test.py
  • tests/unit/steps/text/remove_trailing_apostrophe_space_test.py
📝 Walkthrough

Walkthrough

Adds several English normalization fixes targeting VoxPopuli parliamentary corpus: possessive token cleanup in RemoveTrailingApostropheSpaceStep, tightened digit-space regex, parliamentary rule-to-article rewriting and percent-of insertion in operators, new alias replacements, and hundred-compound state tracking in EnglishNumberNormalizer, plus a new test module covering all these cases.

Changes

English VoxPopuli Normalization

Layer / File(s) Summary
Possessive token cleanup step and docs
normalization/steps/text/remove_trailing_apostrophe_space.py, docs/steps.md
RemoveTrailingApostropheSpaceStep.__call__ applies a second regex to collapse orphan " s" possessive tokens to the base word; docstring and docs describe both behaviors.
Digit-space regex tightening
normalization/steps/text/remove_spaces_between_adjacent_digits.py
Adds (?!\s+[a-z]\b) negative lookahead to the compiled regex, preventing digit-space collapse when followed by a lowercase word token.
Parliamentary reference and percent-of normalization
normalization/languages/english/operators.py
fix_one_word_in_numeric_contexts now rewrites rule <n> to article <n>, loops to rejoin split subsection suffixes, and inserts percent when a digit is followed by of without another digit.
VoxPopuli alias replacements
normalization/languages/english/replacements.py
Maps pttering/puttering/putttering/puttrichpottering and guantnamoguantanamo.
Hundred-compound number parsing
normalization/languages/english/number_normalizer.py
Introduces after_hundred/compound_after_hundred state flags and a reset_number_phrase_state() helper; updates zeros, tens, and multipliers branches to produce thousands-scale values for "hundred … zero" compounds.
VoxPopuli normalization tests
tests/unit/languages/english_voxpopuli_normalization_test.py
New parametrized tests cover alias correction, percent-of patterns, parliamentary references, possessive cleanup, and EnglishNumberNormalizer hundred-compound outputs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • lrossillon-gladia
  • karamouche

Poem

🐇 Hoppity-hop through the corpus we go,
Fixing possessives and hundreds that grow,
"Latvia s" becomes "latvia" with flair,
"rule 42" → "article"—parliament's there!
Percent signs and pottering names fall in line,
The rabbit declares: this normalization's just fine! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the English normalization improvements for VoxPopuli results.
Description check ✅ Passed The description matches the template and includes the change summary, type, checklist items, and testing section.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/update-english-normalization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/unit/languages/english_voxpopuli_normalization_test.py (1)

26-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a positive digit-of percentage regression.

The stack context calls out a new digit-of rewrite, but this table only exercises %, a spelled-out-number positive case, and a digit negative case. A bug in the literal-digit branch would still pass here. Please add something like ("15 of latvia population", "15 percent of latvia population").

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/languages/english_voxpopuli_normalization_test.py` around lines 26
- 38, The percent normalization test coverage in test_percent_of_patterns is
missing a positive digit-`of` case, so a bug in the literal-digit rewrite could
slip through. Extend the parametrized cases in this test to include a
digit-based input like “15 of latvia population” with the expected “15 percent
of latvia population”, so the pipeline.normalize behavior is verified for the
digit-`of` branch as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@normalization/languages/english/operators.py`:
- Line 195: The percent-insertion regex in operators.py is too broad and
rewrites legitimate “<number> of …” phrases, including cases that should remain
as the prior rule/article normalization handled them. Tighten the matching logic
around the existing text rewrite near the percent conversion so it only fires in
the intended spoken-percentage contexts (for example, by constraining it with
surrounding words or possessive/proper-noun patterns), and ensure the
`rule`→`article` replacement and the `re.sub(...)` percent conversion in the
same normalization flow no longer conflict on phrases like “rule 142 of the
agenda” or “5 of the members.”

In `@normalization/languages/english/replacements.py`:
- Around line 1746-1751: The VoxPopuli/parliamentary aliases in
ENGLISH_REPLACEMENTS are too broad and are rewriting valid English tokens like
puttering in all normalization paths. Move these entries out of the shared
replacements map and into a VoxPopuli-specific preset or normalization step,
keeping the general English replacement map free of corpus-specific aliases; use
the replacements module and the ENGLISH_REPLACEMENTS symbol to locate the shared
mapping.

In `@normalization/steps/text/remove_trailing_apostrophe_space.py`:
- Around line 19-20: The current remove_trailing_apostrophe_space text
normalization is over-aggressive because it collapses any 2+ letter "<word> s"
pattern, not just possessives. Update remove_trailing_apostrophe_space.py so the
trailing-s handling happens earlier in the apostrophe-stripping flow, or
restrict it to a narrower corpus-specific transform instead of the shared
text_post substitution; use the existing remove_trailing_apostrophe_space
function as the place to remove the global re.sub on "\b([a-z]{2,}) s\b" and
preserve real tokens like "letter s" and "model s".

---

Nitpick comments:
In `@tests/unit/languages/english_voxpopuli_normalization_test.py`:
- Around line 26-38: The percent normalization test coverage in
test_percent_of_patterns is missing a positive digit-`of` case, so a bug in the
literal-digit rewrite could slip through. Extend the parametrized cases in this
test to include a digit-based input like “15 of latvia population” with the
expected “15 percent of latvia population”, so the pipeline.normalize behavior
is verified for the digit-`of` branch as well.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47c2e6b0-ea95-4762-b791-a1d1a813bc60

📥 Commits

Reviewing files that changed from the base of the PR and between d3a8062 and 387f96a.

📒 Files selected for processing (7)
  • docs/steps.md
  • normalization/languages/english/number_normalizer.py
  • normalization/languages/english/operators.py
  • normalization/languages/english/replacements.py
  • normalization/steps/text/remove_spaces_between_adjacent_digits.py
  • normalization/steps/text/remove_trailing_apostrophe_space.py
  • tests/unit/languages/english_voxpopuli_normalization_test.py

Comment thread normalization/languages/english/operators.py Outdated
Comment thread normalization/languages/english/replacements.py
Comment thread normalization/steps/text/remove_trailing_apostrophe_space.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant